Evaluating Various Tokenizers for Arabic Text Classification

نویسندگان

چکیده

The first step in any NLP pipeline is to split the text into individual tokens. most obvious and straightforward approach use words as However, given a large corpus, representing all not efficient terms of vocabulary size. In literature, many tokenization algorithms have emerged tackle this problem by creating subwords, which turn limits size corpus. Most techniques are language-agnostic, i.e., they do incorporate linguistic features language. Not mention difficulty evaluating such practice. paper, we introduce three new for Arabic compare them other popular tokenizers using unsupervised evaluations. addition, six on supervised classification tasks: sentiment analysis, news poem-meter classification, publicly available datasets. Our experiments show that none best choice overall performance algorithm depends factors including dataset, nature task, morphology richness dataset. some better compared others various tasks.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating Text Clustering Methods for Text Classification

In this project report, I will evaluate the several text clustering approaches and how they can be used for the purpose of text classification. The particular task is topic classification of 20 Newsgroup dataset and sentiment classification restaurant reviews dataset. Future direction for improving the results will also be discussed.

متن کامل

Text Summarization as Feature Selection for Arabic Text Classification

Text classification (TC) or text categorization task is assigning a document to one or more predefined classes or categories. A common problem in TC is the high number of terms or features in document(s) to be classified (the curse of dimensionality). This problem can be solved by selecting the most important terms. In this study, an automatic text summarization is used for feature selection. S...

متن کامل

High capacity steganography tool for Arabic text using 'Kashida'

Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...

متن کامل

A Comparative Study on Arabic Text Classification

This paper focuses on Automatic Arabic classifications. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In classifying Arabic text, there are many published experimental results. Since these results came from different datasets, authors and evaluation metrics, we cannot compare the performance of the experimented classifiers. In this pape...

متن کامل

Arabic Text Classification Using Support Vector Machines

Text classification (TC) is the process of classifying documents into a predefined set of categories based on their content. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In this paper we applied the Support Vector Machines (SVM) model in classifying Arabic text documents. The results compared with the other traditional classifiers Baye...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Neural Processing Letters

سال: 2022

ISSN: ['1573-773X', '1370-4621']

DOI: https://doi.org/10.1007/s11063-022-10990-8